creating a random*sample for panel data - but should represent several countries

Yusra Noorwali

Join Date: Mar 2022

Posts: 23
#1

creating a random*sample for panel data - but should represent several countries

10 May 2023, 08:24

Hello,

I have a huge data file of unbalanced panel data( around 30 million ID-year observations).
I want to create a random sample but keep all years' observations for a particular company ID.
The main data file includes companies from 13 countries (variable: country_code).

How can I create a random sample (of 100,000 different IDs) but keep all years' observations of a particular ID, and the sample should represent all 13 countries? Do I need to sort the data first?

Code:

sort ID year

How the code would be different if I want a % based sample, for example, a sample of 10% of the full data, representing the 13 countries

here's an example of the variables that are important for the sample

* Example generated by -dataex-. To install: ssc install dataex
clear
input str18 id float year str2 country_code
"IT00832220156" 1999 "IT"
"GB01180514" 2018 "GB"
"SE5560525692" 2020 "SE"
"FR826680084" 2007 "FR"
"IT00180310781" 2015 "IT"
"ESA38403028" 2019 "ES"
"FR418872081" 2006 "FR"
"ESB04105847" 2011 "ES"
"BE0819252201" 2010 "BE"
"SE5562265917" 2016 "SE"
"FR378281935" 2000 "FR"
"ESB81518623" 2009 "ES"
"ESB10343051" 2017 "ES"
"ESB75041228" 2015 "ES"
"IT02564770903" 2019 "IT"
"FI22585062" 2019 "FI"
"IT10607171005" 2016 "IT"
"IT01817990276" 2015 "IT"
"FR381004415" 2011 "FR"
"IT01861600342" 1998 "IT"
"FI17091885" 2012 "FI"
"IT01777250976" 2008 "IT"
"ESB31777808" 2009 "ES"
"SE5569792962" 2020 "SE"
"IT09727470966" 2021 "IT"
"IT03005730738" 2018 "IT"
"SE5563133395" 2011 "SE"
"DK31170427" 2021 "DK"
"FR338466295" 1998 "FR"
"ESB97281521" 2019 "ES"
"ESB15665748" 2005 "ES"
"SE5564657269" 2013 "SE"
"IT01930040561" 2013 "IT"
"ESB04251955" 2013 "ES"
"ESB14361810" 2003 "ES"
"ESB12742821" 2011 "ES"
end
[/CODE]

thank you for your help
Tags: None

Clyde Schechter

Join Date: Apr 2014
Posts: 30358

10 May 2023, 09:28

Code:

frame put id, into(ids)
frame ids {
    duplicates drop
    set seed 1234 // OR WHATEVER SEED NUMBER YOU LIKE
    gen double shuffle1 = runiform()
    gen double shuffle2 = runiform()
    by shuffle1 shuffle2, sort: keep if _n <= 100000
}
frlink m:1 id, frame(ids)
keep if !missing(ids)
drop ids
frame drop ids

Comment

Yusra Noorwali

Join Date: Mar 2022

Posts: 23
#3

11 May 2023, 02:33

Hello Clyde,

Thank you for the provided code, but unfortunately, didn't work with me.
I received an error in Stata

command frame is unrecognized

, I think the command

Code:

frame

doesn't work in Stata 14

I also noticed that the code doesn't include the variable country_code. because the number of companies in each country is very different, some countries such as Germany and UK would have much more companies than other countries in my sample. Is it possible a random sample ,but take into account the unequal number of companies in different countries?

Thank you for your help
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30358
#4

11 May 2023, 09:18

All Forum members are asked to read the Forum FAQ before posting. Had you done so, you would have read that if you are not using the current version of Stata (18) you are supposed to state what version you are using. That would have prevented both of us from wasting our time--me writing, and you trying to execute, code that is not possible for you.

You raise two issues. The issue of Stata 14 not supporting -frame- commands has an easy workaround:

Code:

preserve keep id duplicates drop set seed 1234 gen double shuffle1 = runiform() gen double shuffle2 = runiform() sort shuffle1 shuffle2 keep if _n <= 100000 drop shuffle* tempfile selections save `selections' restore merge m:1 id using `selections', keep(match) nogenerate

The other issue, that some small countries might fail to be represented, is not so simple. You are, as I interpret it, looking to oversample small countries to assure that they are represented in the random sample. That can be done, but the only other constraint you have proposed is that the total sample size be 100,000. That constraint fails to narrow down the problem sufficiently. You will need to say how you want to upweight the small countries. One possibility is to say that you want to sample some fixed percentage of the id's in each country. Since you are looking for a sample size of around 100,000 and you are starting with 30,000,000, that means you will want, on average, about 1 in each 300 observations, or 0.33%. So you could redefine the problem as choosing a 1 in 300 random sample, stratifying on each country, with a limitation that at least one id must be chosen from each country. The exact sample size may not turn out to be 100,000, but it won't be far off, and it will guarantee each country's inclusion.

That would look like this:

Code:

preserve keep country id duplicates drop set seed 1234 gen double shuffle1 = runiform() gen double shuffle2 = runiform() by country (shuffle1 shuffle2), sort: keep if _n <= ceil(_N/300) drop shuffle* tempfile selections save `selections' restore merge m:1 id using `selections', keep(match) nogenerate

If this approach doesn't work for you, post back with more specific sampling criteria.

In any case, bear in mind that by oversampling the small countries, you will need to then downweight them in the analysis if your research questions involve obtaining results that are sensitive to between-country differences.
Comment
Yusra Noorwali

Join Date: Mar 2022

Posts: 23
#5

14 May 2023, 07:07

Dear Clyde Schechter ,
Apologies for not mentioning the Stata version, won't happen again.

Thank you for the code provided and the explanation. I used the second code and it worked until the last step (merging)

when using this code

Code:

merge m:1 id using `selections', keep(match) nogenerate

I received the error message

variable id does not uniquely identify observations in the using data

which i find very weird, how there are not uniquely identified "id" id we create the sample from this file originally.

I also tried saving the file not as a temp file , and used this

Code:

clear use 'original file.dta" merge m:1 using "selections" ,keep(match) nongenerate

and also received the same error.
what could possibly go wrong? can't really know

I appreciate your help

Many thanks
Comment
Yusra Noorwali

Join Date: Mar 2022

Posts: 23
#6

14 May 2023, 09:19

I figured that there were duplicates in the 'selections' file, don't know why, dropped the duplicates then the merge worked.

thanks
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30358
#7

23 May 2023, 17:52

Yes, there were duplicates in the `selections' file, but dropping them is not the correct solution. You got a result, but if it is correct, it is just luck. My code in #4 contained an error. The -merge- command should have been:

Code:

merge m:1 country id using `selections', keep(match) nogenerate

I apologize for the error.
Comment

Announcement

creating a random*sample for panel data - but should represent several countries

Comment

Comment

Comment

Comment

Comment

Comment